Computer Generation of Fourier Transform Libraries for Distributed Memory Architectures
نویسنده
چکیده
High-performance discrete Fourier transform (DFT) libraries are an important requirement for many computing platforms. Unfortunately, developing and optimizing these libraries for modern, complex platforms has become extraordinarily difficult. Tomake thingsworse, performance often does not port, thus requiring permanent re-optimizations. Overcoming this problem has been the goal of SPIRAL, a library generation system that can produce highly optimal DFT code from a high level specification of algorithms and target platforms. However, current techniques in SPIRAL cannot support all target platforms. In particular, several emerging target platforms incorporate a distributed memory parallel processing paradigm, where the cost of accessing non-local memories is relatively high, and handling data movement is exposed to the programmers. Traditionally used only in supercomputing environments, this paradigm is nowalso finding its way in the formofmulticore processors into desktop computing. The goal of this work is the computer generation of high-performance DFT libraries for a wide range of distributed memory parallel processing systems, given only a high-level description of a DFT algorithm and some platform parameters. The key challenges include generating code for multiple target programming paradigms that delivers load balanced parallelization acrossmultiple layers of the compute hierarchy, orchestrates explicitmemorymanagement, and overlaps computation with communication. We attack this problem by first developing a formal framework to describe parallelization, streaming, and data exchange in a domain-specific declarative mathematical language. Based on this framework, we develop a rewriting system that structurally manipulates DFT algorithms to “match” them to a distributed memory target architecture and hence extracts maximum performance. We implement this approach as a part of SPIRAL together with a backend that trans-
منابع مشابه
High Performance Linear Transform Program Generation for the Cell BE
The Cell BE is among a new generation of multicore processors including the Intel Larrabee and the Tilera TILE64 that provide an impressive peak fixed or floating point performance for scientific, signal processing, visualization, and other engineering applications. As shown in Fig. 1, the Cell uses simple in-order cores designed specifically for numerical computing, and requires explicit memor...
متن کاملAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures running in a hybrid OpenMP/MPI configuration is presented. Significant boosts in speed are observed relative to the distributed transpose used in the state-of-the-art adaptive FFTW library. In some cases, a hybrid configuration allows one to reduce communication costs by reducing the number of MPI ...
متن کاملAccFFT: A library for distributed-memory FFT on CPU and GPU architectures
We present a new library for parallel distributed Fast Fourier Transforms (FFT). The importance of FFT in science and engineering and the advances in high performance computing necessitate further improvements. AccFFT extends existing FFT libraries for CUDA-enabled Graphics Processing Units (GPUs) to distributed memory clusters. We use overlapping communication method to reduce the overhead of ...
متن کاملParallelism in Spiral
Spiral is a program generator for linear transforms such as the discrete Fourier transform. Spiral generates highly optimized code directly from a problem specification using a combination of techniques including optimization at a high level of abstraction using rewriting of mathematical expressions and heuristic search for platform adaptation. In this paper, we overview the generation of paral...
متن کاملAn Object-oriented Bsp Library for Multicore Programming
We show that the Bulk Synchronous Parallel (BSP) model, originally designed for distributed-memory systems, is also applicable for shared-memory multicore systems and, furthermore, that BSP libraries are useful in scientific computing on these systems. A proofof-concept MulticoreBSP library has been implemented in Java, and is used to show that BSP algorithms can attain proper speedups on multi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010